251 research outputs found
A Dilated Inception Network for Visual Saliency Prediction
Recently, with the advent of deep convolutional neural networks (DCNN), the
improvements in visual saliency prediction research are impressive. One
possible direction to approach the next improvement is to fully characterize
the multi-scale saliency-influential factors with a computationally-friendly
module in DCNN architectures. In this work, we proposed an end-to-end dilated
inception network (DINet) for visual saliency prediction. It captures
multi-scale contextual features effectively with very limited extra parameters.
Instead of utilizing parallel standard convolutions with different kernel sizes
as the existing inception module, our proposed dilated inception module (DIM)
uses parallel dilated convolutions with different dilation rates which can
significantly reduce the computation load while enriching the diversity of
receptive fields in feature maps. Moreover, the performance of our saliency
model is further improved by using a set of linear normalization-based
probability distribution distance metrics as loss functions. As such, we can
formulate saliency prediction as a probability distribution prediction task for
global saliency inference instead of a typical pixel-wise regression problem.
Experimental results on several challenging saliency benchmark datasets
demonstrate that our DINet with proposed loss functions can achieve
state-of-the-art performance with shorter inference time.Comment: Accepted by IEEE Transactions on Multimedia. The source codes are
available at https://github.com/ysyscool/DINe
Towards Robust Curve Text Detection with Conditional Spatial Expansion
It is challenging to detect curve texts due to their irregular shapes and
varying sizes. In this paper, we first investigate the deficiency of the
existing curve detection methods and then propose a novel Conditional Spatial
Expansion (CSE) mechanism to improve the performance of curve text detection.
Instead of regarding the curve text detection as a polygon regression or a
segmentation problem, we treat it as a region expansion process. Our CSE starts
with a seed arbitrarily initialized within a text region and progressively
merges neighborhood regions based on the extracted local features by a CNN and
contextual information of merged regions. The CSE is highly parameterized and
can be seamlessly integrated into existing object detection frameworks.
Enhanced by the data-dependent CSE mechanism, our curve text detection system
provides robust instance-level text region extraction with minimal
post-processing. The analysis experiment shows that our CSE can handle texts
with various shapes, sizes, and orientations, and can effectively suppress the
false-positives coming from text-like textures or unexpected texts included in
the same RoI. Compared with the existing curve text detection algorithms, our
method is more robust and enjoys a simpler processing flow. It also creates a
new state-of-art performance on curve text benchmarks with F-score of up to
78.4.Comment: This paper has been accepted by IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR 2019
SelfReformer: Self-Refined Network with Transformer for Salient Object Detection
The global and local contexts significantly contribute to the integrity of
predictions in Salient Object Detection (SOD). Unfortunately, existing methods
still struggle to generate complete predictions with fine details. There are
two major problems in conventional approaches: first, for global context,
high-level CNN-based encoder features cannot effectively catch long-range
dependencies, resulting in incomplete predictions. Second, downsampling the
ground truth to fit the size of predictions will introduce inaccuracy as the
ground truth details are lost during interpolation or pooling. Thus, in this
work, we developed a Transformer-based network and framed a supervised task for
a branch to learn the global context information explicitly. Besides, we adopt
Pixel Shuffle from Super-Resolution (SR) to reshape the predictions back to the
size of ground truth instead of the reverse. Thus details in the ground truth
are untouched. In addition, we developed a two-stage Context Refinement Module
(CRM) to fuse global context and automatically locate and refine the local
details in the predictions. The proposed network can guide and correct itself
based on the global and local context generated, thus is named, Self-Refined
Transformer (SelfReformer). Extensive experiments and evaluation results on
five benchmark datasets demonstrate the outstanding performance of the network,
and we achieved the state-of-the-art
KSS-ICP: Point Cloud Registration based on Kendall Shape Space
Point cloud registration is a popular topic which has been widely used in 3D
model reconstruction, location, and retrieval. In this paper, we propose a new
registration method, KSS-ICP, to address the rigid registration task in Kendall
shape space (KSS) with Iterative Closest Point (ICP). The KSS is a quotient
space that removes influences of translations, scales, and rotations for shape
feature-based analysis. Such influences can be concluded as the similarity
transformations that do not change the shape feature. The point cloud
representation in KSS is invariant to similarity transformations. We utilize
such property to design the KSS-ICP for point cloud registration. To tackle the
difficulty to achieve the KSS representation in general, the proposed KSS-ICP
formulates a practical solution that does not require complex feature analysis,
data training, and optimization. With a simple implementation, KSS-ICP achieves
more accurate registration from point clouds. It is robust to similarity
transformation, non-uniform density, noise, and defective parts. Experiments
show that KSS-ICP has better performance than the state of the art.Comment: 13 pages, 20 figure
An Iterative Co-Saliency Framework for RGBD Images
As a newly emerging and significant topic in computer vision community,
co-saliency detection aims at discovering the common salient objects in
multiple related images. The existing methods often generate the co-saliency
map through a direct forward pipeline which is based on the designed cues or
initialization, but lack the refinement-cycle scheme. Moreover, they mainly
focus on RGB image and ignore the depth information for RGBD images. In this
paper, we propose an iterative RGBD co-saliency framework, which utilizes the
existing single saliency maps as the initialization, and generates the final
RGBD cosaliency map by using a refinement-cycle model. Three schemes are
employed in the proposed RGBD co-saliency framework, which include the addition
scheme, deletion scheme, and iteration scheme. The addition scheme is used to
highlight the salient regions based on intra-image depth propagation and
saliency propagation, while the deletion scheme filters the saliency regions
and removes the non-common salient regions based on interimage constraint. The
iteration scheme is proposed to obtain more homogeneous and consistent
co-saliency map. Furthermore, a novel descriptor, named depth shape prior, is
proposed in the addition scheme to introduce the depth information to enhance
identification of co-salient objects. The proposed method can effectively
exploit any existing 2D saliency model to work well in RGBD co-saliency
scenarios. The experiments on two RGBD cosaliency datasets demonstrate the
effectiveness of our proposed framework.Comment: 13 pages, 13 figures, Accepted by IEEE Transactions on Cybernetics
2017. Project URL: https://rmcong.github.io/proj_RGBD_cosal_tcyb.htm
Blind Multimodal Quality Assessment of Low-light Images
Blind image quality assessment (BIQA) aims at automatically and accurately
forecasting objective scores for visual signals, which has been widely used to
monitor product and service quality in low-light applications, covering
smartphone photography, video surveillance, autonomous driving, etc. Recent
developments in this field are dominated by unimodal solutions inconsistent
with human subjective rating patterns, where human visual perception is
simultaneously reflected by multiple sensory information. In this article, we
present a unique blind multimodal quality assessment (BMQA) of low-light images
from subjective evaluation to objective score. To investigate the multimodal
mechanism, we first establish a multimodal low-light image quality (MLIQ)
database with authentic low-light distortions, containing image-text modality
pairs. Further, we specially design the key modules of BMQA, considering
multimodal quality representation, latent feature alignment and fusion, and
hybrid self-supervised and supervised learning. Extensive experiments show that
our BMQA yields state-of-the-art accuracy on the proposed MLIQ benchmark
database. In particular, we also build an independent single-image modality
Dark-4K database, which is used to verify its applicability and generalization
performance in mainstream unimodal applications. Qualitative and quantitative
results on Dark-4K show that BMQA achieves superior performance to existing
BIQA approaches as long as a pre-trained model is provided to generate text
description. The proposed framework and two databases as well as the collected
BIQA methods and evaluation metrics are made publicly available on here.Comment: 15 page
- …